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Abstract. With the development of connected niters for the last decade, 
many algorithms have been proposed to compute the max-tree. Max-tree 
allows to compute the most advanced connected operators in a simple 
way. However, no fair comparison of algorithms has been proposed yet 
and the choice of an algorithm over an other depends on many parame- 
ters. Since the need of fast algorithms is obvious for production code, we 
present an in depth comparison of five algorithms and some variations 
of them in a unique framework. Finally, a decision tree will be proposed 
to help user choose the right algorithm with respect to their data. 

1 Introduction 

In mathematical morphology, connected filters are those that modify an original 
signal by only removing connected components, hence those that preserve image 
contours. At the early stage, they were mostly used for image filtering [P71 H4] . 
Breaking came from max and min-tree as hierarchical representations of con- 
nected components and from an efficient algorithm able to compute them [TB"] . 
Since then, usage of these trees has soared for more advanced forms of filtering: 
based on attributes @] , using new filtering strategies [131 US] , allowing new types 
of connectivity. They are also a base for other image representations, in [9] a tree 
of shapes is computed from a merge of min and max trees, in |21j a component 
tree is computed over the attributes values of the max-tree. Max-trees have been 
used in many applications: computer vision through motion extraction fea- 
tures extraction with MSER [5], segmentation, 3D visualization 7.. With the 
increase of applications comes an increase of data type to process: 12-bit image 
in medical imagery [7], 16-bit or float image in astronomical imagery [2], and 
even multivariate data with special ordering relation [12]. With the improvement 
of optical sensors, images are getting bigger (so do image data sets) which urge 
the need of fast algorithms. Many algorithms have been proposed to compute the 
max-tree efficiently but only partial comparisons have been proposed. Moreover, 
some of them are dedicated to particular task (e.g., filtering) and are unusable 



for other purpose. We provide in this paper a full and fair comparison of state-of- 
the-art max-tree algorithms in a unique framework i.e. same architecture, same 
language (C++) and same outputs. 

2 A tour of max-tree: definition, representation and 
algorithms 

2.1 Basic notions for max-tree 

Let ima : fl — > V an image on regular domain fl, having values on totally 
preordered set (V, <) and let A/" a neighborhood on fl. Let A £ V, we note 
[ima < A] the set {p € fl,ima{p) < A}. Let X C fl, we note C'C(X) C V(fl) 
the set of connected components of X w.r.t the neighborhood AT. CC([ima = 
A]), A € V} are level components and {CC([ima > A]), A € y} (resp. <) is the 
set of upper components (resp. lower components). The latter endowed with the 
inclusion relation form a tree called the max-tree (resp. min-tree). The peak 
component of p at level A noted P A is the upper component X E CC ( [ima > A] ) 
such that p£l. 

2.2 Max-tree representation 

Berger et al. [2] rely on a simple and effective encoding of component-trees using 
an image that stores the parent relationship that exists between components. 
A connected component is represented by a single point called the canonical 
element [21 [10] or level root. Let two points p,q 6 f2, and p r the root of the 
tree, we say that p is canonical if p = p r or ima(parent(p)) < ima(p). A parent 
image shall verify the following three properties: 1) parent{p) = p =>■ p = p r 
- the root points to itself and it is the only point verifying this property - 2) 
ima(jparent{p)) < ima(p) and 3) parent(p) is canonical. 
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(b) Max-tree of ima 
Fig. 1: Representation of a max-tree with a parent image and an array. 



Furthermore, this representation requires an extra vector S : N — > fl of 
points that orders the nodes downward. Thus, S must verifies Vi,j £ PI i < 



j =>■ S[j] =/= parent (S[i]). Ordering S vector allows to traverse the tree upward 
or downward without storing children of each node. |Figure l] shows an example 
of such a representation of a max-tree. This representation only requires 2.n.I 
bytes memory space where n is the number of pixels and / the size in bytes of an 
integer (points are positive offsets in a pixel buffer) . The algorithms we compare 
have all been modified to output such a tree encoding. 

2.3 Attribute filtering and reconstruction 

A classical approach for object detection and filtering is to compute some features 
called attributes on max-tree nodes. An usual attribute is the number of pixels 
in components. Followed by a filtering, it leads to the well-known area opening. 
More advanced attributes have been used like elongation, moment of inertia [IB] 
or even mumford-shah like energy |2f j . 

Many max-tree algorithms only construct the parent image but do not care 
about S construction [191 15]. they output incomplete information. We require 
that max-tree algorithms give a "usable" tree, i.e., a tree that can be traversed 
upwards and downwards, that allows attribute computation and non-trivial fil- 
tering. The rational behind this requirement is that, for some applications, fil- 
tering parameters are not known yet when building the tree (e.g., for interactive 
visualization [7]). In the algorithms we compare in this paper, no attribute com- 
putation nor filtering are performed during tree construction for clarity reasons; 
yet they can be augmented to compute attribute and filtering at the same time. 
Algorithm [T] provides an implementation of attribute computation and direct- 
filtering with the representation. / : Q x V — > A is an application that projects 
a pixel p and its value ima(p) in the attribute space A. + ■ A x A — > A is 
an associative operator used to merge attributes of different nodes. COMPUTE- 
ATTPJBUTE starts with computing attribute of each singleton node and merge 
them from leaves toward root. Note that this simple code relies on the fact that a 
node receives all information from its children before passing its attribute to the 
parent. Without any ordering on S, it would not have been possible, direct- 
filter is an implementation of direct filtering as explained in [13] that keeps 
all nodes passing a criterion A and lowers nodes that fails to the last ancestor 
"alive" . This implementation has to be compared with the one in Q35] that only 
uses parent. This one is shorter, faster and clearer above all. 

3 Max-tree algorithms 

Max-tree algorithms can be classified in three classes: 

Immersion algorithms. It starts with building N disjoints singleton for each 
pixel and sort them according to their gray value. Then, disjoint sets merge 
to form a tree using Tarjan's union-find algorithm [15] . 

Flooding algorithms. A first scan allows to retrieve the root which is a pixel 
at lowest level in the image. Then, it performs a propagation by flooding 
firsts the neighbor at highest level i.e. a depth first propagation. [I3" ] 120 ] . 



Algorithm 1 Computation of attributes and filtering. 



function 

compute-attribute^, parent, ima 
Proot «- 5[0] 
for all p 6 S do 

attr(p) «— f(p,ima(p)) 
for all p G S backward, p 7^ p r0 ot do 

5 <s— parent(p) 

attr(q) «— attr (q)+ attr (p) 
return attr 



function 

direct-filter(5, parent, ima, attr) 

Proot <~ 5[0] 

if attr(proot) < A then out(p r oot) = 
else out(proot) = ima(p roo t) 
for all p £ S forward do 
g parent(p) 
if ima(q) — ima(p) then 

out(p) = out(q) > p not canonical 
else if attr(p) < A then 

out(p) <s— out(q) > Criterion failed 
else 

out(p) <s— ima(p) > Criterion pass 
return out 



Merge-based algorithms. They divide an image in blocks and compute the 
max-tree on each sub-image using another max-tree algorithm. Sub max- 
trees are then merged to form the tree of the whole image. Those algorithms 
are well-suited for parallelism using a map-reduce approach |19j . When blocks 
are image lines, a dedicated fD max-tree algorithm can be used [5J|H]- 



3.1 Immersion algorithms 

Bcrgcr et al. [5] and Najman and Couprie [TU] proposed two algorithms based 
on Tarjan's union-find. They consist in tracking disjoints connected components 
and merge them in bottom up fashion. First, pixels are sorted in an array S 
where each pixel p represent the singleton set {p}. Then, we process pixels of S 
in backward order. When a pixel p is processed, it looks for already processed 
neighbor (Af(p)) and merges with neighboring connected components to form 
a new connected set rooted in p. The merging process consists in updating the 
parent pointer of neighboring component roots toward p. Thus, the union-find 
relies on three processes: 

— make-set (parent , x) that builds the singleton set {x}, 

— find-root (parent , x) that finds the root of the component that contains 

x, 

— merge-set (parent , x, y) that merges components rooted in x and y and 
set x as the new root. 

Based on the above functions, a simple max-tree algorithm is given below: 
procedure MAXTREE(ima) 
S 4— sorts pixels increasing 
for all p e S backward do 
make-set (parent, p) 
for all n € Af p processed do 



r <— find-root(pareni, n) 
if r 7^ p then 

merge-set(parent, p, r) 

find-root is a 0{n) function that makes the above procedure a 0(n 2 ) al- 
gorithm. Tarjan |15j discussed two important optimizations for his algorithm to 
avoid a quadratic complexity: 

Root path compression. When parent is traversed to get the root of the compo- 
nent, points of the path used to find the root collapse to the actual root the 
component. However, path compression should not be applied on parent im- 
age because it removes the hierarchical structure of the tree. As consequence, 
we apply path compression on an intermediate image zpar that stores the 
root of disjoints components. Path compression bounds union-find complex- 
ity to 0(n log n) and has been applied in [2J and [10] . 

Union-by-rank. When merging two components A and B, we have to select 
one the roots to represent the newly created component. If A has a rank 
greater than B then root a is selected as the new root, roots otherwise. When 
rank matches the depth of trees, it enables tree balancing and guaranties a 
0(n log n) complexity for union-find. When used with path compression, it 
allows to compute the max-tree in quasi-linear time (0(n.a(n)) where a(n) 
is the inverse of Ackermann function which is very low-growing) . Union- by- 
rank has been applied in [10] . 

Note that parent and zpar encode two different things, parent encodes the 
max tree while zpar tracks disjoints set of points and also use a tree. Thus, 
union-by-rank and root path compression shall always be applied on zpar but 
never on parent. 

Algorithm [2] is the union-find based max-tree algorithm as proposed by 
Berger et al. [2]. It starts with sorting pixels that can be done with a count- 
ing sort algorithm for low-quantized data or with a radix sort-based algorithm 
for high quantized data[T|.Then it annotates all pixels as unprocessed with — 1 
(in standard implementations pixel are positive offsets in a pixel buffer). Later 
in the algorithm, when a pixel p is processed it becomes the root of the com- 
ponent i.e parent(p) = p with p ^= —1, thus testing parent(p) —1 stands for 
is p already processed. Since S is processed in reverse order and merge-set sets 
the root of the tree to the current pixel p (parent(r) <— p), it ensures that the 
parent p will be seen before its child r when traversing S in the direct order. 

Union-by-rank Algorithm [3] is similar to algorithm [2] but augmented with 
union-by-rank. It first introduces a new image rank. The make-set step creates 
a tree with a single node, thus with a rank set to 0. The rank image is then 
used when merging two connected set in zpar. Let z p the root of the connected 
component of p, and z n the root of connected component of n € Af(p). When 
merging two components, we have to decide which of z p or z n becomes the new 
root w.r.t their rank. If rank(z p ) < rank(z n ), z p becomes the root, z n otherwise. 
If both z v and z n have the same rank then we can choose either z v or z n as the 



Algorithm 2 Union find without union- by-rank 



function FlND-ROOT(par, p) 

if par (p) 7^ p then par(p) 4— FlND-ROOT(par,par(p)) 
return par(p) 
function Maxtree (ima) 

for all p do parent(p) < 1 

S <— sorts pixels increasing 
for all p € S backward do 

parent(p) 4— p; zpar(p) 4— p > make-set 

for all n g N p such that parent(n) 7^ — 1 do 
r 4- FIND-ROOT(zpar, n) 
if r 7^ p then 

zpar(r) <— p; parent(r) <— p > merge-set 

CANONlZE(parent, 5) 
return (parent, S) 



new root, but the rank should be incremented by one. On the other hand, the 
relation parent is unaffected by the union-by-rank, p becomes the new root 
whatever the rank of z p and z n . Whereas without balancing the root of any 
point p in zpar matches the root of p in parent, this is not the case anymore. For 
every connected components we have to keep a connection between the root of 
the component in zpar and the root of max-tree in parent. Thus, we introduce 
an new image repr that keeps this connection updated. 

The union-by-rank technique and structure update are illustrated in |Figure 2| 
The algorithm has been running until processing E at level 12, the first neighbor 
B has already been treated and neighbors D and F are skipped because not yet 
processed. Thus, the algorithm is going to process the last neighbor H . z p is the 
root of p in zpar and we retrieve the root z n of n with find-root procedure. 
Using repr mapping, we look up the corresponding point r of z n in parent. The 
tree rooted in r is then merged to the tree rooted in p (parent merge). Back in 
zpar, components rooted in z p and z n merge. Since they have the same rank, we 
choose arbitrary z p to be the new root. 

Algorithm [3] is slightly different from the one of Najman and Couprie [TU] . 
They use two union-find structure, one to build the tree, the other to handle 
flat zones. In their paper, lowernode [z p ] is an array that maps the root of a 
component z p in zpar to a point of current level component in parent (just like 
repr (z p ) in our algorithm). Thus, they apply a second union-find to retrieve 
the canonical. This extra union-find can be avoided because lowernode [x] is 
already a canonical element, thus findoot on lowernode(z p ) is useless and so 
does parent balancing on flat zones. 



Canonization Both algorithms call the CANONiZE(p)rocedure to ensure that 
any node's parent is a canonical node. In algorithm |4j canonical property is 
broadcast downward. S is traversed in direct order such that when process- 



Algorithm 3 Union find with union- by-rank 



procedure MAXTREE(ima) 

for all p do parent(p) < 1 

S -s— sorts pixels increasing 
for all p € S backward do 

parent(p) <— p; zpar(p) p > make-set 

rank(p) <s— 0; repr(p) <s— p 

z p <- p 

for all n £ N p such that parentin) 7^ — 1 do 
z n 4— FlND-ROOT(zpar, n) 
if z„ 7^ 2 P then 

parent(repr(z„)) <s— p 

if rank(zp) < rank(z„) then swap(z p , z n ) 

zpar(z n ) z p > merge-set 

if rank(zp) — rank(z n ) then 
rank(z p ) <— rank(z p ) + 1 
CANONlZE(parent, S) 
return (parent, S) 



ing a pixel p, its parent q has the canonical property that is parent(q) is a 
canonical element. Hence, if q and parent(q) belongs to the same node i.e 
ima(q) = ima(parent(q)) , the parent of p is set to the component's canonical 
element: parent(q). 



Algorithm 4 Canonization algorithm 



procedure Canonize (^ma, parent, S) 
for all p in S forward do 
q <— parent(p) 

if ima(q) = ima(parent(q)) then 
parent(p) parent(q) 



Level compression Union-by-rank provides time complexity guaranties at the 
price of extra memory requirement. When dealing with huge images this results 
in a significant drawback (e.g. RAM overflow. . . ). Since the last point processed 
always becomes the root, union- find without rank technique tends to create 
degenerated tree in flat zones. Level compression avoids this behavior by a special 
handling of flat zones. In algorithm [5j p is the point in process at level A = 
ima(p), n a neighbor of p already processed, z p the root of P p (at first z p — p), 
z n the root of P„ - We suppose ima(z p ) = ima(z n ), thus z p and z n belong to the 
same node and we can choose any of them as a canonical element. Normally p 
should become the root with child z n but level compression inverts the relation, 
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Fig. 2: Union- by-rank, (a) State of the algorithm before processing the neighbor 
H from E. (b) State of the algorithm after processing. 



z n is kept as the root and z p becomes a child. Since parent may be inverted, S 
array is not valid anymore. Hence S is reconstructed, as soon as a point p gets 
attached to a root node, p will be not be processed anymore so its inserted in 
back of S. At the end S only misses the tree root which is parent [S^O]]. 



3.2 Flooding algorithms 

Salembier et al. [13] proposed the first efficient algorithm to compute the max- 
tree. A propagation starts from the root that is the pixel at lowest level l m in- 
Pixels in the propagation front are stored in a hierarchical queue that allows a 
direct access to pixels at a given level in the queue. The f lood(A, r) procedure 
(see algorithm [6]) is in charge of flooding the peak component and building 
the corresponding sub max-tree rooted in r. It proceeds as follows: first pixels at 
level A are retrieved from the queue, their parent pointer is set to the canonical 
element r and their neighbors n are analyzed. If n is not in queue and has not 
yet been processed, then n is pushed in the queue for further process sing and n 
is marked as processed (parent(n) is set to INQUEUE which is any value different 
from -1). If the level I of n is higher than A then n is in the childhood of the 
current node, thus flood is called recursively to flood the peak component P l n 
rooted in n. During the recursive flood, some points can be pushed in queue 
between level A and I. Hence, when flood ends, it returns the level I' of n's 
parent. If I' > A, we need to flood level I' until I' < A i.e until there are no 
more points in the queue above A. Once all pixels at level A have processed, we 
need to retrieve the level Ipar of parent component and attach r to its canonical 
element. A levroot array stores canonical element of each level component and 
-1 if the component is empty. Thus we just have to traverse levroot looking for 
Ipar = max{/i < A, levroot[h] ^ —1} and set the parent of r to levroot[lpar]. 



Algorithm 5 Union find with level compression 



function Maxtree (ima) 

for all p do parent(p) < 1 

S <— sorts pixels increasing 
j = N-l 

for all p € S backward do 

parent(p) 4— p; zpar(p) <— p > make-set 

z P = V 

for all n £ N p such that parentin) 7^ — 1 do 
z n 4- FlND-ROOT(zpar, n) 
if z p 7^ z n then 

if ima(zp) — ima(z n ) then SWAp(()z p , z n ) 

zpar(z n ) 4— z p ; parent(z„) <— z v t> merge-set 

S[j] <- z„; j <- 3 - 1 
S[0] <r- parent[S[0]] 
CANONlZE(parent, S) 
return (parent, S) 



Since the construction of parent is bottom-up, we can safely insert p in front 
of the S array each time parent(p) is set. For a level component, the canonical 
element is the last element inserted ensuring a correct ordering of S. Note that 
the first that gets a the minimum level of the image is not necessary. Instead, we 
could have called flood in Max-tree procedure until the parent level returned 
by the function was -1, i.e the last flood call was processing the root. Anyway, 
this pass has other advantages for optimization that will be discussed in the 
implementation details section. 

Salembier et al. |13j's algorithm was rewritten in a non-recursive implemen- 
tation in Hesselink 0] and later by Nister and Stewenius [TT] and Wilkinson [2D] . 
These algorithms differ in only two points. First, [20 j uses a pass to retrieve the 
root before flooding to mimics the original recursive version while Nister and 
Stewenius [TT] does not. Second, priority queues in [TT] use an unacknowledged 
implementation of heap based on hierarchical queues while in |20j they are im- 
plemented using a standard heap (based on comparisons). The algorithm [7] is 
a code transcription of the method described in Nister and Stewenius [UJ . The 
array levroot in the recursive version is replaced by a stack with the same pur- 
pose: storing the canonical element of level components. The hierarchical queue 
hqueue is replaced by a priority queue pqueue that stores the propagation front. 
The algorithm starts with some initialization and choose a random point p s tart 
as the flooding point. p s tart is enqueued and pushed on levroot as canonical ele- 
ment. During the flooding, the algorithm picks the point p at highest level (with 
the highest priority) in the queue, and the canonical element r of its component 
which is the top of levroot (p is not removed from the queue). Like in the recur- 
sive version, we look for neighbors n of p and enqueue those that have not yet 
been seen. If ima(n) > ima(p), n is pushed on the stack and we immediately 
flood n (a goto that mimics the recursive call). On the other hand, if all neigh- 



Algorithm 6 Salembier et al. [T3] max-tree algorithm 

function flood(A, r) 

while hqueue[X] not empty do 
p <S— POP (hqueue[X\) 
parent(p) <— r 

if p ^ r then insert _front(S, p) 
for all n £ M{p) such that parent(p) = — 1 do 
/ ima(n) 

if /ewroot[/] = — 1 then /ei>root[T] n 

PVSH(hqueue[l],n) 
parent(n) «- INQUEUE 
while I > A do 

I flood(l,levroot[l]) 

> Attach to parent 

iewroot[A] < 1 

Ipar A — 1 

while /par > and levroot[lpar] — — 1 do 

Zpar Zpar — 1 
if Ipar 7^—1 then 

parent(r) ■(— levroot[lpar] 

insert_front(5, r) 
return Ipar 



bors are in the queue or already processed then p is done, it is removed from the 
queue, parent(p) is set its the canonical element r and if r =/= p, p is added to 
5* (we have to ensure that the canonical clement will be inserted last). Once p 
removed from the queue, we have to check if the level component has been fully 
processed in order to attach the canonical element r to its parent. If the next 
pixel q has a different level than p, we call the procedure ProcessStack that 
pops the stack, sets parent relationship between canonical elements and insert 
them in S until the top component has a level no greater than ima(q). If the 
stack top's level matches g's level, q extends the component so no more process 
is needed. On the other hand, if the stack gets empty or the top level is lesser 



Algorithm 6 Salembier maxtree algorithm (continued) 

function MAX-TREE(ima) 

for all h do levroot[h] < 1 

for all p do parent(p) < 1 

Imin muip ima(p) 
Pmin <— arg min p ima(p) 
PVSn(hqueue[l min ], p min ) 
levroot[lmin] p m in 

FLOOD(/ m i n , Pmin) 



than ima{q), then q is pushed on the stack as the canonical element of a new 
component. The algorithm ends when all points in queue have been processed, 
then S only misses the root of the tree which is the single element that remains 
on the stack. 



Algorithm 7 Non-recursive max-tree algorithm [TTJ [2U] 
1: function MAX-TREE(ima) 



2: for all p do parent(p) < 1 

3: Pstart any point in J? 

4: pum(j>queue, p a tart); PVSu(levroot,p s tart) 

5: par ent(p start) «- INQUEUE 

6: loop 

7: p <S— TOP(pqueue); r <S— TOP(levroot) 

8: for all n G N(p) such that parent(p) = —I do 

9: PUSH(pqueue, n) 

10: parent(n) <- INQUEUE 

11: if ima(p) < ima(n) then 

12: PUSH(Zewroot, n) 

13: goto 7 

14: { p is done } 

15: POP(pqueue) 
16: parent(p) ^— r 

17: if p / r then insert _front(5', p) 

18: while pqueue not empty do; 
19: { all points at current level done ? } 

20: q -h- TOP(pqueue) 

21: if ima(q) ^ ima(r) then > Attach r to its parent 

22: PROCESSSTACK(Y, q) 

23: repeat 

24: root <s— POP(Zevroot) 

25: insert_front(S', root) 



Algorithm 7 Non-recursive max-tree algorithm (continued) 



procedure ProcessStack^, q) 
A <— ima(q) 
POP(levroot) 

while levroot not empty and A < ima(TOP(levroot)) do 

insert_front(S', r) 

r parent(r) POP (Zewrooi)) 
if levroot empty or ima(TOP(levroot)) 7^ A then 

PVSn(levroot, q) 
parent(r) TOP(levroot) 
insert_front(S, r) 



3.3 Merge-based algorithms and parallelism 

Merge-based algorithms consist in computing max-tree on sub-parts of images 
and merging back trees to get the max-tree of the whole image. Those algorithms 
are typically well-suited for parallelism since they adopt a map-reduce idiom. 
Computation of sub max-trees (map step), done by any sequential method and 
merge (reduce-step) are executed in parallel by several threads. In order to im- 
prove cache coherence, images should be split in contiguous memory blocks that 
is, splitting along the first dimension if images are row-major. |Figurc 3| shows 
an example of parallel processing using map-reduce idiom. Choosing the right 
number of splits and jobs distribution between threads is a difficult topic that 
depends on the architecture (number of threads available, power frequency of 
each core). If the domain is not split enough (number of chunks no greater than 
number of threads) the parallelism is not maximal, some threads become idle 
once they have done their jobs, or wait for other thread to merge. On the other 
hand, if the number of split gets too large, merging and thread synchronization 
cause significant overheads. Since work balancing and thread management are 
outside the current topic, they are delayed to high level parallelism library such 
as TBB. 
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Fig. 3: Map-reduce idiom for max-tree computation, (a) Sub-domains of ima. 
(b) A possible distribution of jobs by threads, (c) Map-reduce operations. / is 
the map operator, © the merge operator. 



The procedure in charge of merging sub-trees Ti and Tj of two adjacent 
domains _D.j and Dj is given in algorithm [8j For two neighbors p and q in the 
junction of Di, Dj, it connects components of p's branch in Tj to components 
of g's branch in Tj until a common ancestor is found. Let x and y, canonical 
elements of components to merge with ima(x) > ima(y) (x is in the childhood 
to y) and z, canonical element of the parent component of x. If x is the root 
of the sub-tree then it gets attached to y and the procedure ends. Otherwise, 
we traverse up the branch of x to find the component that will be attached to 
y that is the lowest node having a level greater than ima(y). Once found, x 
gets attached to y, and we now have to connect y to it's old parent. Function 
f indrepr(p) is used to get the canonical element of p's component whenever 
the algorithm needs it. 



Algorithm 8 Tree merge algorithm 

function FlNDREPR(par, p) 

if ima(p) 7^ ima(par(p)) then return p 
par(p) <— FlNDREPR(par, par(p)) 
return par(p) 

procedure CONNECT(p,q) 
x FlNDREPR(parent,p) 
y 4— FlNDREPR(pareni, q) 
if ima(x) < ima(y) then SWAp(i, y) 
while x ^ y do 

parent(x) FlNUREPR(parent,parent(x)); 
z -s— parent(x) 
if x = z then 

parent(x) y; y x 
else if ima(z) > ima(y) then 

x <— z 
else 

parent(x) <s— y 
x <- V 
V <r- z 

procedure mergetree(_D;, Dj) 

for all (p, q) £ Di x Dj such that q G Af(p) do 
CONNECT(p,(j) 



Once sub-trees have been computed and merged into a single tree, it does not 
hold canonical property (because non-canonical elements are not updated during 
merge). Also, reduction step does not merge S array corresponding to sub-trees 
(it would imply reordering S which is more costly than just recomputing it at the 
end) . Algorithm [9] performs canonization and reconstructs S array from parent 
image. It uses an auxiliary image dejavu to track nodes that have already been 



> common ancestor found ? 



> x is root 



inserted in S. As opposed to other max-tree algorithms, construction of S and 
processing of nodes are top-down. For any points p, we traverse in a recursive 
way its path to the root to process its ancestors. When the recursive call returns, 
parent{p) is already inserted in S and holds the canonical property, thus we can 
safely insert back p in S and canonize p as in algorithm [4] 



Algorithm 9 Canonization and S computation algorithm 

procedure CanonizeRec(p) 
dejavu(p) = true 
q i— parent(p) 

if not dejavu(q) then > Process parent before p 

CANONIZEREC(q) 

if ima(q) — ima(parent(q) then > Canonize 

parent(p) 4— parent(q) 
InsertBack(5", p) 

for all p do dejavu(p) 4— False 
for all p 6 Q such that not dejavu(p) do 
C anonizeRec (p) 



3.4 Implementation details 

Algorithms have been implemented in pure C++ using STL implementation of 
some basic data structures (heaps, priority queues), Milena image processing 
library to provide fundamental image types and I/O functionality, and Intel 
TBB for parallelism. Specific implementation optimizations are listed below: 

— Sort optimization. Counting sort is used when quantization is lower than 18 
bits. For large integer of q bits, it switches to 2 1 6-based radix sort requiring 
g/16 counting sort. 

— Pre-allocation. Queues and stacks. . . are pre-allocated to avoid dynamic mem- 
ory reallocation. Hierarchical queues are also pre-allocated by computing 
image histogram as a pre-processing. 

— Priority-queues. Heap is implemented with hierarchical queues when quanti- 
zation is less than 18 bits. For large integer it switches to the STL standard 
heap implementation. A y-fast trie can be used for large integer ensuring 



a better complexity (see subsection 3.5 ) but no performance gain has been 
obtained. 

Map-reduce. In parallel version of algorithms, all instructions that deals 
about S construction and parent canonization have been removed since they 
are S is reconstructed from scratch and parent canonized by procedure [9] 



3.5 Complexity analysis 



Let n = H * W with H the image height, W the image width and n the total 
number of pixels. Let fc, the number of values in V. 

— Immersion-based algorithms requires sorting pixels which has a complexity 
of 0(n + k) (k <C n) for small integers (counting sort), O(nloglogn) for large 
integers (hybrid radix sort), and 0(nlog?i) for generic data type with a more 
complicated ordering relation (comparison sort). The union-find is 0(n log n) 
and 0(na(n)) when used with union-by-rank. Q The canonization step is 
linear and does not use extra memory. Memory-wise, sorting may require an 
auxiliary buffer depending on the algorithm and histograms for integer sorts 
so 0(n + k). Union without rank requires a zpar image for path compression 
(&(n)) and the system stack for recursive call in f indroot which is 0(n) 
(f indroot could be non-recursive, but memory space is saved at cost of an 
higher computational time). When union- by-rank is used, it requires two 
extra images (rank and repr) of n pixels each. 

— Flooding-based algorithms require a heap or hierarchical queues to retrieve 
the highest point in the propagation queue. Each point is inserted once and 
removed once (however they may be visited more than once in non-recursive 
versions) thus the complexity is 0(n.p) where p is the cost of pushing or 
popping from the heap. If the heap is encoded with a hierarchical queue 
as in [13] or [11], it uses n + 2.k memory space, insertion is O(l), access 
to the maximum is O(l) but popping is O(k) (in the recursive version, we 
loop on levroot to get the parent level). In practice, in images with small 
integers, gray level difference between neighboring pixels is far to be as large 
as k. With high dynamic image, the heap can be implemented as a y-fast 
trie, which has insertion and deletion in 0(loglog fc) and access to maximum 
element in 0(1). For any other data type V, a "standard" heap based on 
comparisons requires n extra space, allows insertion and deletion in O(logn) 
and has a constant access to maximal element. Those algorithms need also an 
array or a stack levroot to store canonical elements of respective size k and 
n. Salembier's algorithm uses the system stack for a recursion of maximum 
depth fc, hence 0(k). 

— For merge-based algorithm, complexity is a bit-harder to compute. Let A(k, n) 
the complexity of the underlying algorithm used to compute max-tree on 
sub-domains and s = 2 h the number of sub-domains. Map-reduce algorithm 
requires s mapping operations and s — 1 merges. A good map-reduce algo- 
rithm would lead split domain to form a full and complete tree so we assume 
all leaves to be at level h. Merging sub-trees of size n/2 has been analyzed in 

19J and is 0(k log n) (we merge nodes of every fc levels using union-find with- 
out union- by-rank) . Thus, the complexity of the reduction is 0(Wklogn) 



1 a(n), the inverse of Ackermann function, is very low growing, a (10 80 ) ~ 4 



and the complexity S as a function of n and k of the map-reduce is: 
S(k,i) 



2.S(k,%) + kWlogi 
A(k,i) if i < 



Assuming s constant and H = W — y/n this equation solves to: 



n s 



S(k,n) = s.A(k, -) + 0(k.y/n log n logs) = 0(A(k.n)) + 0(ky/n log n) 

s 

When there is as much splits as rows, s in now dependent of n, it leads to 
Matas et al. [5] algorithm whose complexity is: 

S(k,n) = H.W + 0(k.W log nlogH) ~ O(n) + 0(fc % Ai(logn) 2 ) 

Contrary to what they claim, when values are small integers the complexity 
stays linear and is not dominated by merging operations. Finally, canon- 
ization and S reconstruction has a linear time complexity (CanonizeRec is 
called only once for each point) and only uses an image of n elements to 
track points already processed. 



Table 1: Comparison of time complexity of many max-tree algorithms, n is the 
number of pixels and k the number of gray levels. 







Time Complexity 




Algorithm 


Small int 


Large int 


Generic V 


Berger [2] 


0(n log n) 


0(n log n) 


0{n log n) 


Berger + rank 


0(n.a(n)) 


0(n\og logn) 


0(n log n) 


Najman and Couprie 10 


0(n.a(n)) 


0(nlog logn) 


0(n log n) 


Salembier et al. [13] 


0(n.k) 


0(n.k) ~ 0(n 2 ) 


N/A 


Nister and Stewenius [TT] 


0(n.k) 


0(n.k) ~ 0(n 2 ) 


N/A 


Wilkinson [20] 


0(n log n) 


0(n log n) 


0(n log n) 


Salembier non-recursive 


0(n.k) 


0(n log logn) 


0(n log n) 


Menotti et al. [5] (ID) 


0(n) 


O(n) 


O(n) 


Map-reduce 


0{A(k,n)) 


0(A(k,n)) + 


0(A{k,n))+ 






0(ky/nlogn) 


0(ky / n\og n) 


Matas et al. [B] 


0(n) 


0(n) + 0(fcv^(logn) 2 ) 





4 Experiments 

Benchmark were performed on an Intel Core i7 (4 physical cores, 8 logical 
cores). Program have been compiled with gec 4.7, optimization flags on ( — 03 
— march=native). Tests have been conducted on a 6MB 8-bit image. Image has 
been resized by cropping or tiling the original image and over-quantization has 



Table 2: Comparison of space requirements of many max-tree algorithms, n is 
the number of pixels and k the number of gray levels. 

Auxiliary space requirements 



Algorithm 


Small int 


Large int 


Generic V 


Berger et al. [2] 


n + k + stack 


2n + stack 


n + stack 


Berger + rank 


3n + k + stack 


i.n + stack 


3n + stack 


Najman and Couprie |10| 


5n + k + stack 


6n + stack 


5n + stack 


Salembier et al. [13] 


3k + n + stack 


2k + n + stack 


N/A 


Nister and Stewenius [TT] 


2k + 2n 


2k + 2n 


N/A 


Wilkinson [20] 


3n 


3/7, 


3/) 


Salembier non-recursive 


2k + 2n 


3n 


3n 


Menotti et al. [5j (ID) 


k 


n 


n 


Map-reduce 


...+n 


... + n 


...+n 




ber of pixels ( xio 5 ) 




Berger et al. [] 
— Salembier et al. [] 

Najman, Couprie. [] 
x Nister et al. [] 
♦ Berger + level compression - 
-i- Berger + union by rank 
■m Wilkinson [] 



(a) 



(b) 



Fig. 4: (a) Comparison of algorithms on a 8-bit image as a function of size; (b) 
Comparison of algorithms on a 6 Mega-pixels image as a function of quantization. 



been performed by shifting the eight bits left and generating missing lower bits at 
random. |Figure 4] evaluates performance of sequential algorithms w.r.t to image 
size and quantization. A first remark, we notice that all algorithms are linear in 
practice. On natural image, the upper bound nlogn complexity of Wilkinson [20] 
and Berger et al. [2] algorithms is not reached. Let start with union-find based 
algorithms. Berger et al. [5] and Najman and Couprie [TU] have quite the same 
running time (±6% on average), however performances of Najman and Couprie 
[TU] algorithm drops significantly at 256 Mega-pixels. Indeed, at that size each 
auxiliary array/image requires 1 GB memory space, thus Najman and Couprie 
[10] that uses much memory exceeds the 6 GB RAM limit and needs to swap 
with the hard drive. Our implementation of union-by-rank uses less memory and 
is on average 42% faster than Najman and Couprie [10] . Level compression is 
an efficient optimization that provides 35% speed up on average on Berger et al. 



0. However, this optimization is only reliable on low quantized data, figure 4b 



shows that it is relevant until 18 bits. Since it operates on flat-zones, when quan- 




500 1000 1500 2000 

Wall clock time (ms) 



Fig. 5: Time spent in each part of the sequential version of union-find based 
algorithms. 



tization gets higher flat-zones are less probable, thus time saved in findroot 



to find canonical elements does not balance extra tests overheads (see Figure 5 
Union- find is not affected by the quantization but sorting does, counting sort and 
radix sort complexities are respectively linear and logarithmic with the number 
of bits. The break in union-find curves between 18 and 20 bits stands for the 
switch from counting to radix sort. Flooding-based algorithms using hierarchical 
queues outperform our union-find by rank on low quantized image by 41% on 
average. As expected Salembier et al. and Nister and Stewenius [TT] (which 
is the exact non-recursive version of the first one) closely match. However, the 
exponential cost of hierarchical queues w.r.t the number of bits is evident on 
figure | 4b| By using a standard heap instead of hierarchical queues, Wilkinson 
[20] does scale well with the number of bits and outperform every algorithms 
except our implementation of union-by-rank. In |20j . the algorithm is supposed 
to match Salembier et al. |13j's method for low quantized images, but in our 
experiment it stays 4 times slower. As a consequence: 

— since Najman and Couprie [10) 's algorithm is always outperformed by our 
implementation of union-find by rank, it will not be tested any further. 

— [TT] and [201 are merged in our single implementation (called Non-recursive 
Salembier below) that will use hierarchical queues for heap when quantiza- 
tion is below 18 and switches to a standard heap implementation otherwise. 

— the algorithm Berger + level compression will enable level compression only 
when quantization is below 18 bits. 

|Figurc 6| shows results of the map- reduce idiom applied on many algorithms 
and their parallelism. As a first result, we can see best performance are generally 
achieve with 8 threads that is when the number of thread matches the number 
of (logical) cores. However, since there are actually only 4 physical cores, we 
can expect a x4 maximum speed up. Some algorithms take more benefits from 
map- reduce than others. Union- find based are algorithms are particularly well- 



(a) (b) (c) 

Fig. 6: (a,b) Comparison of parallel algorithms on a 6 Mega-pixels 8-bits image 
as a function of number of threads; (a) Wall clock time; (b) Speed up w.r.t 
the sequential version; (c) Comparison of parallel algorithms on a 6 Mega-pixels 
image as a function of quantization. 




Fig. 7: Time spent in each part of parallel versions of algorithms. 



suited for parallelism, union-find with level compression achieves the best speed 
up (x4.2) at 12 threads and union-find by rank a x3.1 speed up with 8 threads. 
More surprising, map-reduce pattern achieves significant speed up even when 
a single thread is used (respectively xl.7 and xl.4 for union- find with level 
compression and union-find by rank) . This result that used to surprise Wilkinson 
et al. |19j as well is explained by a better cache coherence when working on 
sub-domains that balances tree merges overhead. On the other hand, flooding 
algorithms do not scale as well because they are limited by the canonization 
and S reconstruction post-process (that is going to happen also for union-find 
algorithms on architecture with more cores) . In [T^] and [B] , they obtain a speed 
up almost linear with the number of threads because only a parent image is 
built. If we remove canonization and S construction step, we also get those speed 

ups.Q. Figure 6c shows the exponential complexity of merging trees as number FiXme Fatal: Ajouter 

les diagrammes de rep 
des algo en steps 



of bits increases that makes parallel algorithms unsuitable for high quantized 
data. In the light of the previous analysis, |Figure 8] provides guidelines to choose 
the right max-tree algorithm w.r.t to image types and architectures. 



Embedded system ? 
(memory limitation) 




Low 
quantization ? 



Low 
quantization ? 




yes 



Berger et al. [2] 

Berger + level compression 
(± parallelism) 

Salembier et al. [13] or 
Nister and Stewenius [11] 
(± parallelism) 

Berger + rank 



Fig. 8: Decision tree to choose the right max-tree algorithm 



5 Conclusion 

In this paper, we have tried to lead a fair comparison of max-tree algorithms in a 
unique framework. We have highlighted the fact that there is no best algorithm 
that supersedes all the others in every cases and eventually we have given a 
decision tree to choose the right algorithm w.r.t to data and hardware. We have 
proposed a max-tree algorithm using union-by-rank that outperforms the exist- 
ing one from |10j . Furthermore, we have proposed a second one that uses level 
compression for systems with strict memory constraints. As further work, we 
shall include image contents as a new parameter of comparison, for instance im- 
ages with large flat zones (e.g. cartoons) or images having strongly non-uniform 
distribution of gray levels. A code intensively tested used for these benachmarks 
is available on the Internet at http://www.lrde.epita.fr/cgi-bin/twiki/ 
view/Dlena/maxtree. 
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